Keep variable1 if 1 observation per year OR*the minimum number for a variable2 if > 1 observations per year

Kevin Marks

Join Date: Jun 2021

Posts: 24
#1

Keep variable1 if 1 observation per year OR*the minimum number for a variable2 if > 1 observations per year

24 Oct 2021, 13:08

Need help! My goal is to write code to keep a single HbA1c value per year closest to each participant’s birthday.. if there is > 1 HbA1c values done each year. I am working with a time series and longitudinal data set and my table is in long format. My upper column variables are ID_number, HbA1c_test_date, HbA1c_mmolmol, y_status_date (which is the year the HbA1c test was done), diff_days (which is the # of days between the participant’s birthday and 1, 2 or 3 different HbA1c values that were obtained on a given year from 2009 to 2020. I am stuck on my last command. My last command is supposed to only only keep the HbA1c value with minimum # of days (between the birthday and test date) when there are 2 or more HbA1c observations in a given year.

*browse by columns of ID number, year of the HbA1c test and # of days difference between the HbA1c test date and the participant’s birthday.
br ID_number y_status_date new_diff_days

by ID_number y_status_date: generate measure_incl = _n
summarize measure_incl
summarize measure_incl if measure_incl==3
summarize measure_incl if measure_incl==2
summarize measure_incl if measure_incl==1
*Here, I just noted the # of observations when a study participant had 1, 2 or 3 HbA1c values done per year

sort cpr_number new_diff_days

*This command below is not working.
keep if measure_incl == 1 | measure_incl > 1, min(new_diff_days)
*invalid syntax
*r(198)

Any sage advice for me? Again, my goal is to only keep the HbA1c values where there is just 1 per year OR the HbA1c value with the lowest number of days (new_diff_days) between the date of the HbA1c test and the participant’s birthday.
Tags: data management, longitudinal data, panel data, Time Series
Clyde Schechter

Join Date: Apr 2014

Posts: 30085
#2

24 Oct 2021, 13:36

Based on your data description, it sounds like you have one observation for each combination of ID_number, year, and diff_days. That is, there are never two HbA1c observations for the same person on the same date.

It is not necessary to treat the "only one observation" and "more than one observation" cases separately: if there is only one HbA1c, then necessarily it is the one closest to the birthday. So all you need is:

Code:

by ID_number year (diff_days), sort: keep if _n == 1

If this does not do what you want, it is because your data are not as I imagine them to be. That is the drawback to imaginary data. So for better help, post back with example data, using the -dataex- command. Or perhaps I have misunderstood what you want--so also point out clearly how what my code does differs from what you need.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

Last edited by Clyde Schechter; 24 Oct 2021, 13:38.
1 like
Comment
Kevin Marks

Join Date: Jun 2021

Posts: 24
#3

24 Oct 2021, 13:54

Thank you Clyde. Your code worked just fine. Will think about using -dataex- in the future.
Comment

Announcement

Keep variable1 if 1 observation per year OR*the minimum number for a variable2 if > 1 observations per year

Comment

Comment